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Abstract 

Background: Bacterial insertion sequences (IS) of IS200/IS605 and 15607 family often encode a transposase (TnpA) 
and a protein of unknown function, TnpB. 

Results: Here we report two groups of TnpB-like proteins (Fanzorl and Fanzor2) that are widespread in diverse 
eukaryotic transposable elements (TEs), and in large double-stranded DNA (dsDNA) viruses infecting eukaryotes. 
Fanzor and TnpB proteins share the same conserved amino acid motif in their C-terminal half regions: D-X(125, 
275)-[TS]-[TS]-X-X-[C4 zinc finger]-X(5,50)-RD, but are highly variable in their N-terminal regions. Fanzorl proteins are 
frequently captured by DNA transposons from different superfamilies including Helitron, Mariner, /S4-like, Sola and 
MuDr. In contrast, Fanzor2 proteins appear only in some /S607-type elements. We also analyze a new Helitron2 
group from the Helitron superfamily, which contains elements with hairpin structures on both ends. Non- 
autonomous Helitron! elements (CRe-1, 2, 3) in the genome of green alga Chlamydomonas reinhardtii are flanked by 
target site duplications (TSDs) of variable length (approximately 7 to 19 bp). 

Conclusions: The phylogeny and distribution of the TnpB/Fanzor proteins indicate that they may be disseminated 
among eukaryotic species by viruses. We hypothesize that TnpB/Fanzor proteins may act as methyltransferases. 

Keywords: DNA transposon, TnpB, Fanzor, Helitron, Helitron!, IS200/605, IS607, Methyltransferase 



Background 

Transposable elements (TEs) are DNA segments that are 
duplicated and inserted into genomic DNA by a variety of 
mechanisms. There are two major groups of TEs: DNA 
transposons and retrotransposons. Retrotranposons are 
further divided into those containing long terminal 
repeats (LTRs), or LTR retrotransposons, and non-LTR 
retrotransposons, which are not flanked by LTRs. Typically, 
TEs encode only proteins essential for their reproduc- 
tion and insertion, including reverse transcriptases and 
transposases (Tpases). Currently, there are four known 
types of transposases encoded by TEs. The most common 
type is the DDE-transposase encoded by most bacterial in- 
sertion sequences (IS), eukaryotic DNA transposons, and 
LTR retrotransposons. The second group is represented by 
reverse transcriptases (RT), encoded by a variety of non- 
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LTR and LTR-retrotransposons. The third group includes 
tyrosine recombinases (YR) encoded by IS91 [1], Helitron 
[2], IS200/IS605 [1], Crypton [3], and £>/i?S-retrotransposon 
families [4,5]. The last group is represented by serine 
recombinases (SR), encoded by IS607 family, Tn4451, and 
bacteriophage phiC31 [6]. The structural features and spe- 
cific transposition mechanisms differ fundamentally among 
these TE groups. Most DNA transposons are flanked by 
terminal inverted repeats (TIRs) and target site duplications 
(TSDs), and are transposed by the cut-and-paste' mechan- 
ism used by DDE transposases, although some use replica- 
tive mechanism (Tn3) [7], or are able to switch to 
replicative mode (for example, MuDr, Tn7 and IS903 
[8-11]). LTR-retrotransposons use RT and integrase (DDE- 
transposase) to complete their transposition. Non-LTR 
retrotransposons need both RT and endonuclease (EN) in 
their transposition process termed target site-primed re- 
verse transcription (TPRT) [12]. Transposons using YR and 
SR as Tpase lack TIRs and produce no TSDs upon inser- 
tion. However, their terminal hairpin structures (IS200/605 
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family) or terminal short direct repeats (Crypton) are 
important for transposition [3,13,14]. 

Elements from the IS200/IS605 and IS607 families 
usually encode a secondary protein (TnpB) of unknown 
function, in addition to transposase (TnpA). Three inde- 
pendent experiments on IS607, ISHp608, and ISDm2 
elements (the latter two belong to the IS200/IS60S 
family), have shown that TnpB is dispensable for the 
transposition in Escherichia coli [14,15] and Deinococcus 
radiodurans [16]. Interestingly, numerous IS elements 
(for example, IS1341, IS809 and IS1136) encode TnpB 
as the only protein (putative transposase), but the 
supporting evidence for TnpB-mediated transposition is 
still missing. Like other elements from the IS200/IS605 
and IS607 families, these TnpB-only transposons lack 
TIRs and TSDs. One possibility is that these elements 
represent non-autonomous derivatives of IS607 or IS200I 
75605-like transposons, where TnpA is deleted. Due to this 
uncertainty, most of the TnpB-only elements are ambigu- 
ously assigned to the IS200/IS605 family in the ISfinder 
database (http://www-is.biotoul.fr) [17]. 

In this paper, we report two groups of TnpB-like pro- 
teins, named as Fanzorl and Fanzor2 (collectively called 
Fanzor), from diverse eukaryotic genomes, including 
metazoans, fungi, and protists (amoeba, chlorophyte, 
stramenopile, choanoflagellate and rhodophyta), as well 
as dsDNA viruses that infect eukaryotes. Fanzor and 
TnpB protein both contain a constellation of strictly 
conserved residues stretching from the protein center to 
the C-terminus, D-X(125, 275)-[TS]-[TS]-X-X-[C4 zinc 
finger]-X(5,50)-RD. The C4 zinc finger is called 
OrfB_Zn_ribbon ([CDD:pfam07282]) in the Conserved 
Domain Database (CDD) [18]. Phylogenetically, Fanzorl 
proteins form a single separate clade, and Fanzor2 proteins 
co-cluster with a small set of bacterial TnpB proteins from 
the IS607 family. Fanzorl proteins were captured by trans- 
posable elements from at least five different superfamilies: 
Mariner, Sola, IS4, Helitron and MuDr. Fanzor2 proteins 
are encoded by the 7S607-type transposons. While bio- 
logical function of the Fanzor/TnpB proteins is not known 
at present, there are indications that the Fanzorl protein 
may be functioning as a methyltransferase. This is based on 
comparison of three elements, PGv-1, Mariner-2_PGv and 
Mariner- lOLpv, each encoding three proteins, including 
Mariner-Tpase, endonuclease and either methyltransferase 
or Fanzorl protein. Our data also suggest that viruses may 
facilitate spreading Fanzor proteins in eukaryotes. 

The analysis of Fanzor proteins also revealed 'one- 
ended transposition' in three non-autonomous Helitron 
transposon families (CRe-1, 2, 3) in green algae Chlamy- 
domonas reinhardtii. Of particular interest is the 'one- 
ended' group of Helitrons flanked by TSDs. One-ended 
transposition has been previously reported in IS91 family 
in bacteria [19,20], but not as associated with generation 



of TSDs [20]. Finally, we describe a new Helitron group 
(Helitron2) that is distinct from the canonical Helitron 
elements (Helitronl). Helitronl elements contain only 
one hairpin structure at the 3'-subterminal region, and 
with conserved 5'-TC and CTRR-3' ends [2]. In contrast, 
Helitron2 elements carry two hairpin structures and 
short (8 to 15 bp) asymmetric terminal inverted repeats 
(ATIRs) at the ends. The 5'-ATIR is close to the 5'- 
terminus, pairing with its downstream nucleotides to 
form a 5'-hairpin structure; the 3'-ATIR is subterminally 
located, immediately upstream from the hairpin struc- 
ture. Individual Helitron2-like elements were reported to 
differ from the canonical Helitronl sequences in terms 
of their terminal features [21-24], however the features 
were not associated with any separate Helitron group. 
The characteristic Helitron2 features may help improve 
the performance of the automatic detection programs 
that are currently using only the Helitronl features 
[25,26]. 

Results 

Identification of the eukaryotic TnpB-like proteins 

During a systematic screening of TEs, a prototype of the 
eukaryotic TnpB-ZS200//5605-like protein was first dis- 
covered in the genome of the fungus Spizellomyces 
punctatus. This protein, called SPu-l-lp (633-aa), is 
encoded by one single open reading frame (ORF) in the 
SPu-1 element (approximately 2,100-bp long), flanked by 
33-bp terminal inverted repeats (TIRs) and putative TA 
target site duplications (TSDs). We identified 17 full- 
length SPu-1 copies, approximately 92% identical to the 
family consensus, including nine copies with intact 
ORFs. The immediate homologues of the SPu-l-lp were 
found in some related eukaryotes, but distant homo- 
logues were identified among TnpB proteins encoded by 
bacterial insertions elements (ISs) from the IS200/IS605 
and IS607 families (approximately 15% identity over an 
approximately 300 aa C-terminal region; Figure 1). To 
date, we have identified dozens of SPu-l-lp homologues 
in at least 26 diverse eukaryotic genomes, as well as in 
18 large dsDNA virus species infecting eukaryotes 
(Table 1). The 26 eukaryotes belong to 7 taxonomic 
groups: metazoa, choanoflagellida, fungi, amoebozoa, 
chlorophyta, rhodophyta, and stramenopiles (Table 1). 
Hereafter, these eukaryotic TnpB-like proteins are re- 
ferred to as Fanzor proteins and their bacterial counter- 
parts are referred to as TnpB proteins. As expected, the 
vast majority of the Fanzor proteins, if not all of them, 
are encoded by TEs, which are collectively referred to as 
Fanzor elements. Consensus sequences of these elements 
were reconstructed whenever possible. Some elements 
are flanked by TIRs but others display no TIRs at their 
ends (see Additional file 1). In some Fanzor elements, a 
bona fide transposase is encoded along with the Fanzor 
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Figure 1 (See legend on next page.) 
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(See figure on previous page.) 

Figure 1 Motifs and alignments of Fanzor and TnpB proteins. Conserved amino acids and helix-turn-helix (HTH) domain are marked (above); 
gray regions indicate the variable N-terminal halves. Numbers above the diagram refer to the residue position in SPu-1-1 p or TnpB_IS608. Titles of 
Fanzorl proteins are shaded in the alignment. 



protein (see Additional file 1). Therefore, based on the 
presence of Tpase or other characteristic DNA features 
most Fanzor elements can be classified into different 
superfamilies (see below). The corresponding DNA and 
protein sequences are listed in Additional file 2 and 
Additional file 3. 

Sequence feature and phylogeny of Fanzor proteins 

The N-terminal halves of the Fanzor and TnpB proteins 
are highly diverged, but their C-terminal halves are rela- 
tively conserved and include strictly conserved amino 
acid motif D-X(125, 275)-[TS]-[TS]-X-X-[C4 zinc fin- 
ger] -X(5, 50) -RD (Figure 1, see Additional file 4). To 
date, this long motif was found only in TnpB and Fanzor 
proteins, and it includes a short, previously characterized 
OrfB_Zn_ribbon domain ([CDD:pfam07282]). Given that 
Fanzor and TnpB are both associated with TEs, the 
shared motif strongly suggests that they are functional 
homologues, rather than unrelated proteins accidentally 
carrying the same domain. 

Fanzor proteins are divided into two distinct clades, 
Fanzorl and Fanzor2 (Figure 2), as indicated by the 
phylogenetic tree based on the nearly entire sequence 
lengths (see Additional file 4). The major Fanzorl clade 
consists exclusively of eukaryotic proteins. In contrast, 
the minor Fanzor2 clade co-clusters with several TnpB 
proteins from the prokaryotic IS607 family, such as the 
ISArmal element. The co-clustering of Fanzor2 and 
TnpB is not caused by sequence contamination, because 
multiple proteins are found in each category. Apart from 
the few TnpB proteins co-clustered with Fanzor2 clade, 
all the other TnpB proteins are out-grouped together. 
Notably, virus-borne Fanzor proteins come from both 
Fanzor clades (Figure 2). For example, two different 
strains of one virus: Emiliania huxleyi virus 88 and 
Emiliania huxleyi virus 99B1, carry EHv88-l element 
from the Fanzorl clade, and EHv99Bl-l element from 
the Fanzor2 clade, respectively (see Additional file 1). 
On the other hand, highly similar Fanzor proteins can 
be found in viruses with completely different genomic 
sequences. For example, HVav-1 element is 88% iden- 
tical to HAmn-1 over the entire length. However, the 
two hosting virus genomes ([GenBank:EF133465] and 
[GenBank:EU730893], respectively) share no detect- 
able similarities at all. 

Helix-turn-helix (HTH) domain ([CDD:pfaml2323]: 
HTH_OrfB_IS605) is present in the N-terminal regions 
of some TnpB and Fanzor2 proteins (Figure 1), including 
those encoded by 75607, IS891, ISArmal and ISvAR158J. 



Given that the alignment in this local area is relatively well 
conserved (see Additional file 4), this HTH domain is 
presumably present in other TnpB proteins, but due to the 
high sequence divergence whether or not a comparable 
HTH domain exists in Fanzorl proteins could not be deter- 
mined. Two additional amino acids are also extremely 
conserved in the Fanzorl proteins (G500 and E536, 
Figure 1). However, this may reflect a smaller divergence of 
the Fanzorl clade than that of the TnpB clade (Figure 2). 

Fanzorl protein in Tc/mariner elements 

Some Fanzorl elements, such as PGv-1 and PUl-1 
(Figure 3, see Additional file 5), encode both the Fanzorl 
protein and a Mariner-like Tpase. Other elements, such 
as PUl-4, encode Fanzorl proteins only but carry TIRs 
identical to confirmed members of the Mariner family, 
and all are flanked by TA TSDs, a hallmark of the 
Mariner transposons (Figure 3). The most interesting 
examples are four related, single-copy Mariner elements, 
including PGv-1, Mariner-2_PGv, Mariner- lOLpv and 
HMa-1. The four elements share significant sequence 
similarity in their TIRs and 5'-terminal regions (approxi- 
mately 78% identical, 1 kb long) coding for the Mariner 
Tpases (Figure 3), but they differ in their 3' portions. 
Nevertheless, two proteins encoded by the 3' portions of 
the former three Mariner elements appear to be func- 
tionally comparable. For example, the first of the two 3' 
proteins (suffixed '2p') encoded by PGv-1, Mariner-2_PGv 
and Mariner- lOLpv are endonucleases and the other pro- 
teins (suffixed '3p') in Mariner-2_PGv and Mariner- lOLpv 
are methyltransferases. Specifically, PGv-l-2p (291-aa) con- 
tains a GIY-YIG nuclease [27] domain ([CDD:cll5257]) at 
its N-terminus (E-value = 4.44e-09; see Additional file 6). 
Mariner- 2_PGv-2p (256-aa) is annotated as a hypothetical 
restriction endonuclease in the REBASE database (The 
Restriction Enzyme Database) [28]. Mariner-l_OLpv-2p 
(198-aa, [GenBank:ADX06147.1]) contains the C-terminal 
catalytic domain of the restriction endonuclease EcoRII 
([CDD:pfam09019]), which is well supported by the 
sequence alignment despite of the low score (E-value = 1.1) 
in CDD database (see Additional file 7). Mariner- 2_PGv-3p 
(459-aa, [GenBank:AET72984.1]) contains the methyl- 
transferase domain Methyltransf_26 ([CDD:pfaml3659]; E- 
value: 9.61e-08), and Mariner- l_OLpv-3p (344-aa, [GenBank: 
ADX06148.1]) contains the Cyt_C5_DNA_methylase do- 
main ([CDD:cd00315]; E-value: 8.99e-72). Based on this 
parellelism (Tpase, endonuclease and methyltransferase), 
one possibility is that the third protein encoded by PGv-1 
(that is, Fanzor protein) is also a methyltransferase. Notably, 
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Table 1 Species harboring Fanzor sequences 



Taxon /Group Species/Strain name 



Number Fanzorl 
family 



Number Fanzor2 
family 



Element 
prefix 



Metazoa Mayetiola destructor 

Hydra magnipapillata 
Choanoflagellida Salpingoeca sp. (ATCC 50818) 



Fungi 



Amoebozoa 



Chlorophyta 



Rhodophyta 
Stramenopiles 



dsDNA virus 



Spizellomyces punctatus 
Rhizopus oryzae RA 99-880 
Allomyces macrogynus ATCC 38327 
Phycomyces blakesleeanus NRRL1555 
Mucor circinelloides 
Ashbya gossypii ATCC 10895 
Eremothecium cymbalariae DBVPG#7215 
Saccharomyces cerevisiae ECU 18, Lalvin QA23 
Torulaspora delbrueckii 
Dictyostelium fasciculatum 
Polysphondyllum pallidum PN500 
Acanthamoeba castellanii strain Neff 
Volvox carter! 

Chlamydomonas reinhardtii 
Chlorella vulgaris strain NJ-7 
Cyanidioschyzon merolae 
Pythium ultimum 
Nannochloropsis oceanic 
Phytophthora sojae 
Phytophthora capsici 
Phytophthora ramorum 
Albugo laibachii Ncl4 
Ectocarpus siliculosus 

Ectocarpus siliculosus virus ([GeneBank:AF204951], 3354;b) 

Shrimp white spot syndrome virus ([GenBank:AF332093], 3054;b) 

Helicoverpa armigera granulovirus ([GenBank:EU255577], 169-kb) 

Helicoverpa armigera multiple nucleopolyhedrovirus ([GenBank: 
EU730893], 1 54-kb) 

Pseudaletia unipuncta granulovirus ([GenBank:EU678671], 176-kb) 

Spodoptera frugiperda ascovirus la ([GenBank:AM398843], 1 S7-kb) 

Heliothis virescens ascovirus 3e ([GenBank:EF1 33465], 186-kb) 

Mamestra configurata nucleopolyhedrovirus B ([GenBank: 
AY1 26275], 158-kb) 

Phaeocystis globosa virus 12T ([GenBank:HQ634147], 460-kb) 
Emiliania huxleyi virus 88 ([GenBank:JF974310], 397-kb) 
Emiliania huxleyi virus 99B1 ([GenBank:FN429076], 377-kb) 
Acanthamoeba poiyphaga mimivirus ([GenBank:AY653733], 1181 -kb) 

Acanthamoeba castellanii mamavirus ([GenBank:JF801956], 1 1 92-kb) 

Megavirus chiliensis ([GenBank:JN258408], 1259-kb) 

Paramecium bursaria Chlorella virus AR158 ([GenBank: 
DQ491003], 345-kb) 
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Table 1 Species harboring Fanzor sequences (Continued) 



Paramecium bursaria Chlorella virus NY2A ([GenBank:DQ491002], 


2 


ISvNY2A 


369-kb) 






Cafeteria roenbergensis virus BV-PWi ([GenBank:GU244497], 617-kb) 


1 


CRv-i 


Feldmannia species virus ([GenBank:NC_01 1 183], 1554xb) 


1 


FEsv-1 



HMa-1 also might have originated from an unknown virus 
despite the fact that it is found in the Hydra magnipapillata 
contig sequence ([GenBank:ABRM01000004.1], 154-kb), be- 
cause the closest relatives of the multiple upstream and 
downstream proteins, flanking the HMa-1 element, are also 
viral proteins. 

Fanzorl protein in Helitron transposons 

There are three Fanzorl elements (CRe-1, 2, 3) in the 
genome of single-celled green alga Chlamydomonas 
reinhardtii, which most likely represent non-autonomous 
Helitron transposons (specifically, Helitron2 group of trans- 
posons described below). Their 5'-end 200-bp, and 3'-end 
50-bp sequences, are highly similar (approximately 90% 
and 70% identity, respectively), to those of verified Helitrons 
(that is, Helitron- l_CRe, Helitron- IN l_CRe and Helitron- 
lN2_CRe; Figure 4D, see Additional file 8A). The Fanzorl 
proteins are encoded by five exons in CRe-1 element, and 
by ten exons in CRe-2 and CRe-3 elements (Figure 4A). 
These exons are supported by a number of expressed se- 
quence tags (EST). 

The three Fanzorl families (CRe-1, 2, 3) are frequently 
5'-truncated, and coupled with internal deletions 
(Figure 4A, 4F, see Additional file 8B). However, almost 
all copies are intact at the 3'-terminal regions (Figure 4F). 
This biased 3'-overabundance implies that duplication 
process by the rolling cycle replication starts from the 
3'-end, which is analogous to the previously reported 
one-ended transposition in bacterial IS91 element [20]. 
Data from Helitron- IN l_CRe and Helitron- lN2_CRe in- 
dicate that these Helitrons insert specifically downstream 
from the 5'-TTTT-3' tetranucleotide, producing no 
TSDs (Figure 4E). However, this non-TSD feature only 
appears in CRe-1, 2, 3 insertions that terminate exactly 
at the consensus 5'-ends, such as the loci 2, 3, 8, 9, 13 in 
Figure 4A. Strikingly, most other insertions, especially 
5'-truncated ones, are flanked by TSDs of variable length 
(approximately 7 to 19 bp; Figure 4B). In some cases 
much longer TSDs are observed (44, 50, 93, 242 and 
443-bp long). Approximately 70% of CRe-1 (150 loci), 
57% of CRe-2 (70 loci), and 10% of CRe-3 (35 loci) are 
flanked by TSDs. This varying percentage probably re- 
flects different family ages, since CRe-1 is the youngest 
family with elements approximately 98% identical to the 
consensus. Interestingly, almost all of these 5'-TSDs are 
located downstream from the same tetranucleotide as 
observed in the Helitron- IN l_CRe or Helitron- lN2_CRe 



insertions (TTTT, or T-rich tetranucleotides: TTTG, 
TTTC, TCTT, TGTT), suggesting a common mechan- 
ism involved at least in the target recognition process, in 
the Helitron and the three non-autonomous Fanzorl 
families. In some individual CRe-1, 2, 3 insertions, short 
extra sequences are present downstream the 5'-TSDs 
(locus 1 and 7, Figure 4A). The captured sequences 
can occur upstream from the normal consensus 5 - 
termini (locus 1, Figure 4A). Intriguingly, TSDs are 
extremely rare in the cases of the non-autonomous 
Helitron- IN l_CRe and Helitron- lN2_CRe elements. For 
example, only one out of 200 Helitron- IN l_CRe ele- 
ments is flanked by TSDs. Elements of the two families 
are 95 to 98% identical to their consensus sequences. 
It is not clear whether the difference between the 
three Fanzorl elements and the two non-autonomous 
Helitron elements is caused by the Fanzorl protein or 
by the relatively short length of the Helitron-lNl _ 
CRe elements (657 bp) or Helitron- lN2_CRe elements 
(673 bp). 

Features of Helitron2 elements 

CRe-1, 2, 3 and many other Helitron elements from dif- 
ferent species, such as Helitron- l_CRe, Helitron-2_CRe 
and Helitron- 5 _SMo, display two distinct features at the 
terminal regions. The first one is called short asymmet- 
rical terminal inverted repeats (ATIRs), located asym- 
metrically at the ends: the 5'-ATIR is 0 to 2 bp away 
from to the 5'-end, and the 3'-ATIR is approximately 20 
to 30 bp apart from the 3'-end, upstream of the hairpin 
structure (Figure 5 A, 5C). The second feature is the 5'- 
terminal hairpin structure, involving a part or the whole 
5'-ATIR sequence (Figure 5A, 5C). The two structural 
features are assumed to be important for transposition. 
Particularly, compensatory base mutations were ob- 
served in two related elements (that is, Helitron- 
INI CQu and Helitron- lN2_CQu) to maintain such 
features (Figure 5C). Possibly, during the ending phase 
of the rolling cycle replication, the pairing between the 
5'-ATIR and 3'-ATIR destroys the 5'-hairpin structure, 
and thus determines the replication endpoint. All 
Helitrons with such features are significantly clustered in 
one phylogenetic group, called Helitron2 in this paper, 
whereas all Helitrons with the canonical structures con- 
stitute a separate group (Helitronl), with elements lack- 
ing the 5'-hairpin structure [2] (Figure 5A, 5B; see 
Additional file 9). Nevertheless, both Helitronl and 
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Figure 2 Phylogeny of Fanzor/TnpB proteins. Eukaryotes and eukaryotic viruses are colored as follows: dsDNA viruses (blue), metazoa (yellow), 
fungi (green), chlorophyta (cyan), rhodophyta (red), stramenopiles (dark red), choanoflagellida (orange), and amoebozoa (pink). TnpB proteins are 
from the ISfinder database (ISXXX) or GenBank (with accession number). The tree is based on the alignment of a longer region including most of 
the N-terminal and the C-terminal portions (see Additional file 4). 
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Mariner- l_OLpv 5' CACCTTTTTACATTTCAAACGCCGATTTTATAAG 

Mariner- l_OLpv 3' CACCTTTTTACATTTCAAACGCCGATTTTAGTTC 

HMa-1 5' CACCTTTGCACATTTAAAACGCTGATTTTTTAAA 

HMa-1 3' CACCTTTGCACATTTAAAACGCCGACTTAACGAC 



ROr-2 



PUI-1,2 



PUI-4 

PUI-5, 6; PCa-1 



TA | (Manner I Fanzor > | TA 

TA [)lMarineftl Fanzor ><] TA 

Ta E~I Fanzor ~J<] tA 

I hanzor > | TA 



PUl - 1 5 ' 
PU1-2 5' 
PUl -4 5 ' 



PCa-1 
PU1-1 
PUl -2 
PUl -4 
PUl -5 
PUl -6 



TTATGAAAGGCACCGAGAATGTCC 
TTACGGAATTCAGTGAGAATGTCA 
TTATGGATTTCATTGAGAATGTCA 

TCTAGTAAGGACTCTCTCAGTGCCAT - TATTA 
TCTAGTA-GGACATTCTCAGTGCCATCAGTAA 
AGGTGACATTCTCGATGAATTCCGTAA 
GGGTGACATTCTCACTGAATTCCATAA 
AGGTGACATTCTCGGTGAATCCCGTAA 
GCGTGACATTCTCGGTAACTT - TGTAC 



Figure 3 Fanzor proteins in Mariner elements. Transposable elements (TEs) are indicated by bars flanked by TA target site duplications (TSDs); 
the undetermined ends are indicated by dash lines {PUI-5, 6; PCa-1). The triangles at the element ends represent the terminal inverted repeats 
(TIRs) sequences. The inner arrows indicate the protein coding regions (dashed lines indicate the degenerated coding sequences). The alignments of the 
5' and 3' TIRs sequences are shown on the right. 



Bao and Jurka Mobile DNA 2013, 4:12 
http://www.mobilednajournal.eom/content/4/1/12 



Page 8 of 16 



CRe-1 ttttEH 



ZEE 



1 tttt<> 

2 TTTT 

3 TTTT 

4 TTTT<>- 

5 
6 



3992 

— o 



ttttO- 
CRe-2 TTTTEi 



-o 



ttttO- 



-o 



7 


ttttO 


8 


TTTT 


9 


TTTT 


10 


TT 


11 




12 





4882 

vi m a mm g 

o 



ttttO- 



TTTT-O- 



TTTT-O^ 



-o 
-o 
-o 



CRe-3 

13 
14 
15 



TTTTEHI 

TTTT : 



4668 

3 



TTTT-O- 



ttttO^ 



-o 
-o 



B 



> 1 GCGTTTTTCTGGTCTGA< - >TCTGGTCTGA 

> 4 GTGTTTTTGGTGAAGC< - >TGGTGAAGC 

> 5 CGTTTTTTAGTGCATC< - >TAGTGCATC 

> 6 TGATTTTTGCCCGTATGTA< - >TGCCCGTATGTA 

> 7 GCCTTTTCCCTTGATGGTG< - >CCCTTGATGGTG 

> 10 TCCTTTTCTCCAGCTGGTG<->CTCCAGCTGGTG 

> 12 GCATTTGCACGATCGATGGCAGTACG<->CACGATCGATGGCAGTACG 

> 14 GCGTTTTTTACCCCCCGC< - >TTACCCCCCGC 

> 15 ACATTTTCTAAACCAGCCCATAGCCG<->CTAATCCAGCCCATAGCCG 



Helitron-1_CRe tttt|T 



RepHel 



^3 



-45 



Helitron-1N1_CRe 

-650 bp 

Helitron-1N2_CRe 

-660 bp 



-35 



no 



CRe-1 
CRe-2 
CRe-3 

Fanzor- lNl_CRe 
Fanzor-lN2_CRe 
Helitron-1 CRe 



GGGAGGGG2 3GGCTCAGCCC - 
GTGGCGTGG CGGCWCAGCCC \CTTTTCC- 
ACTGTGGAG 3GGCTCAGCCC : 



■ TTTTCCTGCCTCCCTAAGGC - - AGCCACCT - - CCTTGT 
- CTTG - - CAAGGGAGAGCCACCTTTTCTTGT 

- TTTTCC - - CGTG - - ACCGGG - GAA- CA CTTGT 

AGCGTACGGlGGGCTCAGCCCK- TTTTCC - AGCTGTACAAAGC - - TGACACC CCTTGT 

CGCTTCATG GGGCTCAGCCC \-TTTTCC-AGCTGTACAAAGC- -TGACACC CCTTGT 

ATCCCGCGG GGGCTCAGCCC r - TTTTCC -AGCTGTACAAAGC - - TGACCCC CCTTGT 



Helitron-1N1 CRe 



<vv" 



TTGT, 93loci 



AACTGTTTTGCTTTT |Helitron-1N1_CRe| TG AATAATCCTTTTG 
AACTGTTTTGCTTTT TG AATAATCCTTTTG 



ABCN01001933.1 (1 7844-1 71 60) 



TACTGTTGCTTTTTT |Helitron-1N1_CRe| TGTTTTTCGATTGCT 
TACTGTTGCTTTTTT TGTTTTTCGATTGCT 



ABCN01002965.1 (1 8430-1 7746) 



AGCACAACGTATTTT |Helitron-1 N2_CRe| CCAACGTACTAGCCG 

AGCACAACGTATTTT CCAACGTACTAGCCG 

ABCN01001984.1 (12737-1 3439) 

Figure 4 (See legend on next page.) 
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(See figure on previous page.) 

Figure 4 CRe-1, 2, 3 elements. (A) CRe-1, 2, 3 consensus sequences and the exons (black boxes). Dotted areas indicate the 5'-ends, 
approximately 200- bp long, which are 98% identical to those of confirmed Helitrons. Asterisks at the 3'-ends indicate the short homologous 
regions in CRe-1, 2, 3 and Helitron elements (C, D). The corresponding sequences of the15 example loci (1 to 15) are indicated by solid lines 
below. Dashed lines mark the internal deletion regions. Nine of them are flanked by target site duplications (TSDs) indicated by small diamonds. 
Note that locus 1 and 7 include short segments of 'non-Fanzor' sequences (gray line) at the 5'-ends. The sequences of the 15 loci are shown in 
Additional file 8B. (B) Examples of the nineTSD sequences (shaded). Note that the 5'-TSDs are immediately downstream of I I I I tetra-nucleotides. 
(C) Helitron-1 _CRe and non-autonomous Helitrons. (D) The alignment of the 3'-ends of Helitrons and CRe-1, 2,3. The 3' asymmetrical terminal 
inverted repeats (ATlRs) are boxed. (E) Target specificity of Helitron-1 N1_CRe elements. They insert specifically between I I I I and T/C and produce 
no TSDs. Helitron-1 N2_CRe elements also insert after I I I I. Three examples of the pre- and post-insertion sites are shown. (F) The illustration of the 
5'-truncation or 3'-overabundance in CRe-1 elements: graphical summary of a NCBI online BLASTN search of the Chlamydomonas reinhardtii 
genome with the consensus of CRe-1. 



Helitron2 elements have 3'-terminal hairpin structures, 
and show similar 5'-end nucleotide preferences: TC in 
Helitronl and T in Helitron2. With this hindsight, the CRe- 
1, 2, 3 elements are confirmed as Helitron2 transposons 
(Figure 5C). It is worth noting that in some Helitron2 
elements, such as Helitron-2_CRe and Helitron- 1DR, the 
RepHel protein is in the opposite orientation relative to the 
majority (Figure 5A, 5C). 

Fanzorl protein in /S4-type elements 

754-type Tpase and Fanzorl proteins are present in two 
families, ESvi-lB and ESv-2 (Figure 6). The two families are 
in the genome of brown algae Ectocarpus siliculosus and 
algae virus Ectocarpus siliculosus virus-1 (ESV1, [GenBank: 
AF204951]), respectively. Other related elements, either 
encoding Fanzorl or ZS4-type Tpase, such as ESvi-lA and 
IS4_ESvi, are also found in the algae genome (Figure 6). All 
these elements are single-copy in the genomes, flanked by 
18-bp terminal inverted repeats (TIRs) similar to those of 
ISHch2 element, which is annotated as IS4 family in the 
ISfinder database (Figure 6). ESvi-lB and ESv-2 elements 
share approximate 1 kb long 5'-terminal sequences coding 
for 7S4-type Tpase (78% sequence identity), but differ com- 
pletely in the other regions, where Fanzorl proteins are 
encoded. This situation is analogous to that between PGv-1 
and Mariner-2_PGv elements described above (Figure 3). 
Notably, although ESvi-lA, ESvi-lB and IS4_ESvi elements 
were identified in the genome of brown algae E. siliculosus, 
they should be viewed as virus-borne elements ('vi' in each 
name stands for 'virus integrated'). They are found in 
two contig sequences ([GenBank:CABU01010405.1] and 
[GenBank:CABU01010404.1]) that are approximately 84% 
identical to the ESV1 virus genome ([GenBank:AF204951]), 
likely representing large integrated virus fragments [29]. Be- 
sides, there is another Fanzor family in the ESV1 genome, 
ESv-1, probably associated with non-IS4 families. Individual 
elements from the ESv-1 family are flanked by 2-bp TSDs 
(TA) and variable TIRs. 

Fanzorl protein in Sola2 elements 

In the amoebozoa Dictyostelium fasciculatum, there are 
three related Fanzorl families (DFa-1, 2, 3) classified as 



the So/a2-type elements [30]. A putative 888-aa Sola2- 
type Tpase is encoded by the DFa-2 elements (Figure 7 A, 
see Additional file 10). Moreover, the three families are 
flanked by 12 or 13-bp TIRs and AT-rich 4-bp TSDs 
(AWWT) (Figure 7A, see Additional file 11). The 4-bp 
TSDs feature is consistent with that of Sola2 family [30]. 
In DFa-1 and DFa-3 elements most of the So/a2-Tpase 
coding region is deleted. The three families are nearly 
identical in the 5' regions (approximately 2.5 to 3 kb 
from the 5'-end), but no sequence similarity was detected 
in most other regions, where Fanzorl proteins are encoded 
(Figure 7A). Interestingly, such 5o/a2-Fanzor chimeric ele- 
ments also appear in PPa-1, 4, 5 families in amoebozoa spe- 
cies Polysphondylium pallidum (Figure 7B, see Additional 
file 10). Among them, the 5'-terminal 7-kb sequences are 
nearly identical (98% identity), coding for Sola2 Tpases, but 
the 3'-terminal sequences are entirely different. These 
chimeric elements are flanked by short imperfect TIRs (21- 
bp), and 4-bp AT-rich TSDs (that is, ATAT, AAAT, ATTT; 
Figure 7B, see Additional file 11). 

Fanzorl protein in other transposable elements 

Fanzorl proteins were also found in DNA transposons 
from other superfamilies. For example, in the genomes 
of fungi Rhizopus oryzae, Phycomyces blakesleeanus and 
Mucor circinelloides, ROr-4, PBl-3 and MCi-4 elements, 
respectively, appear to belong to the MuDr superfamily 
(see Additional file 12). While these elements do not en- 
code MuDR Tpase, all carry TIRs similar to those of 
confirmed MuDR elements (for example, MuDr-2_PBl) 
and are flanked by 9-bp TSDs. 

In the genomes of five insect-infecting viruses, five closely 
related Fanzorl families, FPVav-1 (Heliothis virescens ascovirus 
3e), SFav-1 (Spodoptera Jrugiperda ascovirus la), PUgv-1 
(Pseudaletia unipuncta granulovirus), HAgv-1 (Helicoverpa 
armigera granulovirus) and HAmn-1 (Helicoverpa armigera 
multiple nucleopolyhedrovirus), are flanked by 4-bp TSDs 
(TTAN) and 13-bp TIRs (see Additional file 13). However, 
they could not be assigned to any particular superfamily due 
to the lack of Tpase information. 

In the genome of the fungus Mucor circinelloides, 
MCi-2 family is unclassified due to its unusual features 
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(Figure 8A). A total of 11 MCi-2 copies (loci) are found 
in the genome. They differ in the 5' regions (approxi- 
mately 1 to 3 kb long), but are nearly identical in their 
6-kb 3' regions (99% identity), where Fanzor proteins are 
encoded. Based on their 5' variable regions, four subfam- 
ilies were identified out of ten loci (MCi-2A, 2B, 2C, and 
2D), where each subfamily is represented by two or three 
copies. The 11th locus is probably incomplete, and it is 
represented by a single copy in the genome (Figure 8A). 



The MCi-2A and MCi-2D subfamilies are represented 
by three and two presumably complete copies, respect- 
ively. They are flanked by 11 or 12-bp TSDs (Figure 8B), 
but they lack recognizable TIRs. The TSDs show the 
same pattern, ATAATTNNNN(N), implying that the 
two subfamilies use the same mechanism of transpos- 
ition, though they have different 5'-end sequences. Not- 
ably, although the MCi-2A subfamily contains a partial 
coding sequence for a Crypton Tpase (approximately 
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TCCCQGSGCARICTGCAGATTGCGGAGGAATGTGTCGCCTCCAAAACTGTCC 
TAGCCSISGCGACTACGAAGTAGTCTAGCGGAGGCCTTAGGCCGATTCGCTG 
TajTAGICCCGACGTTAGACGGACGACGTTTAGTCGTTCCGTCGTCGGGACG 
TGGTTTGCrTAAGCAAACCTGTAGGAGGACTGGGATGGTAATGAATATACTT 
TCICTMGOGGRTAGCTTAGAGCCAGCGTCATTTCTTGTCAGACCATCAGTT 
TAGGGQCCaUW3CACCGTAGGTGCGTAGGCACCTATTGCTTTCGTTAGTGTTC 
TAKTGGGGTTTaCCATAGGCAAAGTCCATTCTTACTATTGTGGTTTATTATT 
TC3TGACGAAAGGCCTTATGGGGCCTCTCGTCACTATGCCATCGGTTTGAGTG 
TACTGGGATACCTGTGGTAGCCCAGTTATAAATTGCAACTGTTTTTTTTCCC 
TATCCCCTCOCCGACGAAGTCGGAGGGGATATAGTAACGCTGCCGTCCGTCC 
TATACCCTQAACCCATTAAAAATGGGTACAAGGGTATATTGTATTTGTGCAA 
TAOACTttTTCACGAAAGTGAAATGAGTATATTGGTTTTGTGCAATGTCCGT 
TGACATAGGACTATGTCTCTACTTACTATATTGGGGGGCCAGTTCAGAAATT 
JxWalAAiAAiy^I^iciTAGGGCTWAj'icAAGGGTAGC 



GGGAQaaCTCAflCCCITTTCCTGCCTCCCTAAGGCAGCCACCTCCTTGT 

GCGTQGCQQCWCAQCCCACTTTTCCCTTGCAAGGGAGAGCCACCTTTTCTTGT 

TGGAGGGGCTCAGCCCCTTTTCCCGTGACCGGGGAACACTTGT 

CCGC(3GG®3CTCIAGCCCITTTTCCAGCTGTACAAAGCTGACCCCCCTTGT 

AGAGGftGGRGGCTGTGCACAATCAGGCCCCCGGCCTAGCTGG 

CCAAATTGCTCCS3GGGGGATGCGTCAGCATCCCGTCATGT 

AGGAGCTACGGCCCACTCGCGCGAACAGCGCGCTCTAG 

CCGGGGGACTACGGCCCCCTCGCGGGCGAAGCAGCCCGCGTCTAG 

ACAAAAGCAAACCAAGCGTTGCTGCAATGCAACGCGCAGCAATTTAT 

TCGGTCOGCTraOAQACCCGCACGCGAAGCGCGCCCCAGT 

AATAaTOCTTaGCCCCTTAATTGCTGCTTGCAGCTATATT 

CCACGC3UVACCCCATTCACATTTCCTTTAGGAAATGTACAATTCTAGT 

TATACCTTGAQTCACATGCAGTCTTTGACTGCCTATTTTTT 

CCATCAGGTATCCCAGTCAACCCTGCGCGCAGGGTCTTGT 

CCCTQCQAaaaOATGCTACGCTTTGCGTTGACTTGT 

TACTTQaGTTCAQOOTATTTTCTAGTCGGGCACCCCCGACTAGAGCACTCTTACTTGT 

GGCTaTQAATOAaTCrTCTGCAATCGGTATCGCCCGACTGCAACGTTCTTACTTGT 

ATTGTTTGSTCCTATGTCTTTC-GGTTATGTCTCTGACAT-ACCATCTTGAACTTTTT 

III. • I I ■ I I I I I I I I I I I . I I I I I I I I I I I I I . I I III 
CTTGCAAAflTTCfortQTCATACTGGTCATGTGTA- -ACATAACCACCTGCACCTTTTT 



Figure 5 Helitron2 subgroup and its terminal structure. (A) Structural features of Helitronl and Helitron2 groups. Helitronl has only 3'- 
subterminal hairpin structure [2]. The dark arrows in Helitron2 represent the asymmetric terminal inverted repeats (ATlRs). (B) Phylogeny of the 
RepHel proteins encoded by Helironl and Helitron2 groups. The black square indicates Helitrons in which both termini are known. The alignment 
of the RepHel proteins is shown in Additional file 9. (C) Terminal sequences from the selected examples of Helitron2 elements. The bases in bold 
font represent the ATlRs. Pairing nucleotides in the hairpins are shaded in gray. The 5'-end sequences of CRe-1, 2, 3 are similar to that of Helitron- 
1_CRe (not shown). Note that the RepHel protein is encoded in the opposite direction in Helitron-2_CRe and Helitron-1 _DR (marked with #). The 
compensatory mutations in the complementary segments are highlighted by asterisks at the bottom in the alignment of Helitron-1 N1_CQu and 
Helitron-)N2_CQu. 
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473-aa), it lacks approximately 200-aa at its C-terminus 
when compared to other fungal Crypton Tpases (see 
Additional file 14). It remains uncertain if the MCi-2 
elements belong to the Crypton superfamily, because 
Cypton elements have not been known to produce TSDs. 
Moreover, it is unclear whether the 5' fuzzy ends of MCi- 
2A and MG-2D result from incomplete transposition/du- 
plication or if there are other reasons (Figure 8B). 

As in the case of MCi-2, the classification of MCi-S 
family is also unknown. Five MCi-5 elements (loci) were 
identified in M. circinelloides genome, three of which 
{Locus-1, 2, 3) appear to be complete elements, flanked 
by putative 6-bp TSDs (ATTTAT), while no significant 
TIRs were detected (Figure 8C). Interestingly, the 
Harbinger-type Tpase (2 exons) is encoded by three MCi 
elements (Locus-1, 2, 4; Figure 8C, see Additional file 15). 
It is unclear whether the Harbinger Tpases are involved in 
the transposition of MCi-5 elements, because, in contrast 
to other typical Harbinger elements, MCi-5 elements lack 
any obvious TIRs, and the potential TSDs (ATTTAT) are 
not 2 or 3-bp long as in other typical Harbinger ele- 
ments [31]. 

In the red alga Cyanidioschyzon merolae genome, ap- 
proximately 150 copies of CMe-lA elements are found, 
each approximately 80% identical to the consensus. Its 
complete consensus is shown to be around 3-kb long, 
but the TSDs could not be determined, probably due to 
high diversity. Interestingly, the 5' 635-bp of CMe-lA is 
95% identical to the entire sequence of another trans- 
posable element, TE-N2_CMe, which is represented 
approximately by 70 copies in the genome (Figure 8D). 
Both CMe-lA and TE-N2_CMe elements lack TIRs and 
their TE classification is unknown. 

Fanzor2 proteins in IS607-\\ke elements 

Except for the Fanzor2 proteins, the only TnpA_IS607- 
like serine recombinases (SR) could be found in some 
Fanzor2 elements, such as ACa-1, -2, CRv-1, ISvMimi_l, 
ISvMimi _2, ISvAR158_l, and ISvNY2A_l (Figure 8E, see 
Additional file 1). In the bacterial IS elements that co- 
cluster with Fanzor2 elements, only TnpA_ZS607-like 



serine recombinases (SR) were found, such as in 
ISArmal (Figure 2). All these elements have no TIRs or 
TSDs, suggesting Fanzor2 and these IS elements might 
have a common origin. 

Discussion 

The mysterious role of Fanzor/TnpB in transposition 

Prokaryotic TnpB proteins are encoded by bacterial 
transposable elements of IS200/605 or IS607 family. 
Here we report two groups of TnpB homologues 
(Fanzorl and Fanzor2) encoded by diverse transposable 
elements from different eukaryotic species, as well as 
from some large DNA viruses that infect eukaryotes. Fanzor 
and TnpB proteins are functionally uncharacterized, but 
they share the same set of extremely conserved motifs in 
their C-terminal halves: D-X(125,275)-[TS]-[TS]-X-X-[C4 
zinc finger] -X(5,50)-RD (Figure 1). While Fanzor2 proteins 
are closer to prokaryotic TnpB, also encoded by ZS607-like 
elements, Fanzorl proteins are encoded by diverse TEs, 
and are more distantly related to TnpB than the Fanzor2 
proteins (Figure 2). 

TnpB/Fanzor proteins are not DDE-type Tpases. Why 
are they so frequently found in various transposons? 
Can Fanzor/TnpB represent a novel type of Tpase that 
could propagate DNA element alone? This possibility 
can be ruled out in IS200/605 or IS607 families, where 
tyrosine recombinase or serine recombinase (TnpA) is 
known to be the functional Tpase, and TnpB proteins 
appear to be dispensable for transposition [14-16,32]. Al- 
ternatively, could TnpB/Fanzor represent a captured 
passenger gene with functions irrelevant for the trans- 
position process, such as antibiotic resistance genes? 
This is also unlikely because they would be present in 
many different types of IS elements, rather than only in 
IS200/605 and IS607 families from bacterial genomes. 

In a third scenario, TnpB/Fanzor proteins may func- 
tion as regulatory proteins in an unknown transposition 
processes in vivo. In fact, the complexity of the transpos- 
ition process has been studied in Tn7 transposon, which 
encodes five proteins and all are involved in transpos- 
ition. The proteins are: TnsA (type II restriction 
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Figure 6 Fanzorl protein in IS4 elements. /54-type elements in Ectocarpus siliculosus and Ectocarpus siliculosus virus-1. The alignment of 
terminal inverted repeats (TIRs) is shown on the right. The 8-bp perfect TSDs flanking ESvi-1B are indicated by diamonds. Note that the ESv-2 
element is named as ISvEsVl_1 in the ISfinder database, where the encoded Fanzorl protein is annotated as a passenger protein of 
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endonuclease), TnsB (DDE Tpase), TnsC (a regulator be- 
tween the TnsAB and TnsD or TnsE), TnsD (directing 
transposition to attTn7 sites) and TnsE (directing trans- 
position to non-attTn7 sites) [33]. Other transposon- 
encoded non-Tpase proteins, potentially involved in 
transposition were also reported recently by Kapitonov 
et al. [34]. They include the SNF2 helicase in Inton and 
Enton, DEDDh nuclease in P and piggyBac, and RecQ 
Helicase in Academ. It is worth noting that Fanzor/TnpB 
proteins contain some DNA-binding domains: a zinc-finger 
-like domain near to the C-termini, and a N-terminal HTH 
domain in TnpB and Fanzor2 (probably, in the Fanzorl 
proteins as well; Figure 1), suggesting their involvement in 
the transposition process. 

The presumed function in transposition is also sug- 
gested by an example of an old Fanzorl family, CMe-lA 
(Figure 8D). CMe-lA elements are approximately 80% 
identical to the family consensus, but some individual 
CMe-lA elements still encode intact Fanzorl proteins. 
This long lasting coding capability would seems unusual 
for a "non-autonomous" family {CMe-lA) if no function 
is associated with the Fanzorl protein. Analogous cases 
exist in the so-called HAL1 "non-autonomous" families 
derived from the LI non-LTR retrotransposons, which 
encode the first open reading frame protein (ORFlp) 
only, instead of both ORFlp and ORF2p [35]. ORFlp is 
a "nucleic acid chaperone with RNA binding [36] and 
nucleic acid chaperone activity [37], but ORF2p codes 
for the major Tpase with its endonuclease (EN) and re- 
verse transcriptase (RT) activity. In the guinea pig gen- 
ome the coding capacity of the ORFlp in the HAL1 
retrotransposons has been maintained for a relatively 



long time (approximately 29 to 44 Myr) [35], implying 
that both the tis-encoded ORFlp and trans-encoded 
ORF2p are required for transposition of HAL1 elements. 

Comparison of three virus-integrated Mariner trans- 
posons, PGv-1, Mariner-2_PGv and Mariner- lOLpv 
(Figure 3) may provide some clues regarding the poten- 
tial function of the TnpB/Fanzor protein. Each Mariner 
element encodes three proteins showing some functional 
parallelisms: Tpase, endonuclease, and methyltransferase in 
Mariner-2_PGv and Mariner- lOLpv or Tpase, endonucle- 
ase and Fanzor in PGv-1. In bacteria, methytransferases 
and restriction endonucleases constitute the restriction-and 
-modification system important in many cellular processes. 
Therefore, it is interesting to see that both endonuclease 
and methyltransferase are encoded by some transposons 
(Mariner-2_PGv and Mariner 1-1 _QLpv). To our know- 
ledge, the presence of methyltransferase in transposons has 
not been reported before. The potential role of the 
transposon-encoded methyltransferases in transposition 
remains largely unknown. Normally, DNA methylation is 
essential for inhibiting the expression and transposition of 
TEs [38,39]. For example, methylation in the terminal se- 
quence of transposons can prevent binding of transposase 
[40,41]. Theoretically, methylation may also protect the 
DNA in transposome from cutting by restriction enzymes, 
especially in bacterial cells. Moreover, it was reported that 
deoxycytosine methylase (Dcm) and EcoRII methylase 
could increase the Tn3 transposition frequency in E.coli 
[42]. There are other circumstantial data consistent with 
this methyltransferase-hypothesis. First, while the vast 
majority of TnpB proteins are annotated as transposases in 
the NCBI database, a handful of them are indeed annotated 



Bao and Jurka Mobile DNA 2013, 4:12 
http://www.mobilednajournal.eom/content/4/1/12 



Page 13 of 16 



as DNA (cytosine-5-)-methyltransferases (for example, 
[GenBank:YP_001645687.1]). However, the basis for this 
annotation is not documented. Second, GipA ([GenBank: 
AAF98319.1]) is a TnpB-like protein encoded by an IS 
element carried by the lambdoid phage Gifsy-1. GipA has 
been shown to be a virulence gene in Salmonella enterica 
[32]. Analogously, DNA adenine methylase (Dam) is known 
as an important factor in bacterial virulence [43-45]. The 
above observations are consistent with the possibility that 
Fanzor protein could be a methytransferase. 

Fanzor elements in viruses 

In the current dataset, 18 different large dsDNA 
eukaryotic viruses were found carrying Fanzor elements 
(Table 1). In contrast, only 24 eukaryotic species are 
found carrying Fanzor elements. This is unexpected 
given the relatively small genomes of these viruses. How- 
ever, this may be partly explained by a possibility that 
Fanzor protein assumes the same role both in the viral 



infection and TE transposition. In a sense, both viruses 
and DNA TEs are selfish or parasitic episomes. 

In the phylogenetic tree, the viral Fanzor proteins are 
intermingled with non-viral eukaryotic Fanzor proteins 
(Figure 2). This suggests that these large-genome viruses 
may play an extensive role in spreading Fanzor genes (or 
other TEs) among eukaryotes. Among currently se- 
quenced metazoan species, only one insect species, hes- 
sian fly (M. destructor), was found to carry Fanzor 
elements. The HMa-1 element in H. magnipapillata 
probably originally also came from a virus genome. All 
the 13 Fanzor families in the M. destructor genome 
significantly co-cluster with 5 viral Fanzor families, in- 
cluding HAgv-1, SFav-1, PUgv-1, HAmv-1 and HVav-1 
(PUgv-1, HAmv-1 and HVav-1 are not included in 
Figure 2). These viruses are all insect-infecting viruses 
suggesting that they may participate in spreading Fanzor 
elements. Interestingly, the genomes of Heliothis 
virescens ascovirus 3e (HVav, [GenBank:EF133465]) and 
Helicoverpa armigera multiple nucleopolyhedrovirus 
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(HAmn, [GenBank:EU730893]), share no overall se- 
quence similarity at all, but each of them contains one 
copy of a Fanzor element, HVav-1 and HAmn-1, respect- 
ively, 88% identical to each other over the entire length. 
Notably, the two viruses infect insect species of the same 
Noctuidae family. Finally, the Phaeocystis globosa virus 
12T (PGv, [GenBank:HQ634147]) and Organic Lake 
phycodnavirus 1 (OLpv-1, [GenBank:HQ704802.1]) gen- 
ome share no overall sequence similarity at all, except 
for the Mariner-2_PGv and Mariner lOLpv elements in 
their genomes, respectively, which are 79% identical in 
their 5'- terminal regions (Figure 3). Both viruses infect 
phototrophic marine algae: PGv infects Phaeocystis 
globosa and OLpv-1 probably infects prasinophyte 
Pyramimonas [46]. 

Fanzor proteins are often found in chimeric elements 
represented by the following 4 sets of TEs: (1) PGv-1, 
Mariner-2_PGv, Mariner- lOLpv and HMa-1 (Figure 3); 
(2) ESvi-lB and ESv-2 (Figure 6); (3) DFa-1, DFa-2 and 
DFa-3 (Figure 7A); (4) PPa-1, PPa-4 and PPa-S 
(Figure 7B). The first two sets are from the virus ge- 
nomes. The latter two sets of elements are present in 
two related slime mold species: D. fasciculatum and P. 
pallidum. These chimeric Fanzor elements probably also 
originated with the involvement of viruses. 

Conclusions 

Fanzor and TnpB are homologous proteins. Hypothetic- 
ally, they may function as methytransferases. Eukaryotic 
Fanzor proteins are associated with many diverse 
eukaryotic viruses. The relatively small number of 
Fanzor elements in Eukaryotes probably reflects the fact 
that they were relatively recently transferred by viruses. A 
more frequent horizontal transfer in bacteria may account 
for the more common presence of the TnpB proteins in 
diverse bacteria and phages [47,48]. The two clades of 
Fanzor elements (Fanzorl and Fanzor2), might have origi- 
nated from two independent transfers from bacteria to 
eukaryotes. 

Methods 

Transposons were automatically detected using custom- 
made scripts based on the methods described before [49]. 
Consensus sequences of each family were constructed 
whenever possible. Potentially new TE proteins encoded by 
long ORFs, were screened out by TblastN against Rebase 
database [50]. The PSI-Blast and TBLASTN screening for 
homologous proteins was done against all available se- 
quence databases at the National Center for Biotechnology 
Information (NCBI) and at the Department of Energy Joint 
Genome Institute (JGI). To detect all distandy related 
eukaryotic proteins, multiple rounds of PSI-Blast were 
performed until no more new significant scores were 
detected. Each newly detected eukaryotic protein was used 



as query to repeat this procedure. In addition to NCBI da- 
tabases, the following genome sequences were downloaded 
from the JGI:, Phycomyces blakesleeanus NRRL1555 
and Mucor circinelloides (http://genome.jgi-psf.org/Phybl2/ 
Phybl2.download.ftp.html, http://genome.jgi-psf.org/Mucci2/ 
Mucci2.download.ftp.html). The TE-encoded multiple-exon 
genes were predicted by FGENESH program (http://linuxl. 
softberry.com/berry.phtml?topic=fgenesh&group=programs& 
subgroup=gfind), and confirmed or refined with expressed 
sequence tag (EST) information whenever possible. Func- 
tional motifs in these proteins were identified by search 
against the Conserved Domain Database (CDD) (http://www. 
ncbi.nlm.nih.gov/cdd/). Multiple protein sequences were 
aligned by online MAFFT (v6.861b), using Web server 
(http://mafft.cbrc.jp/alignment/software/) [51]. Sequence phy- 
logenies were obtained using PhyML (v3) [52] available at 
Phylogeny.fr web server (http://www.phylogeny.fr/) [53], and 
the phylogeny tree was rendered by MEGA4 [54]. The DNA 
and encoded protein sequences encoded by the TEs are listed 
in the Additional file 2 and Additional file 3. 
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